185 research outputs found
Statistical analysis of the owl:sameAs network for aligning concepts in the linking open data cloud
The massively distributed publication of linked data has brought to the attention of scientific community the limitations of classic methods for achieving data integration and the opportunities of pushing the boundaries of the field by experimenting this collective enterprise that is the linking open data cloud. While reusing existing ontologies is the choice of preference, the exploitation of ontology alignments still is a required step for easing the burden of integrating heterogeneous data sets. Alignments, even between the most used vocabularies, is still poorly supported in systems nowadays whereas links between instances are the most widely used means for bridging the gap between different data sets. We provide in this paper an account of our statistical and qualitative analysis of the network of instance level equivalences in the Linking Open Data Cloud (i.e. the sameAs network) in order to automatically compute alignments at the conceptual level. Moreover, we explore the effect of ontological information when adopting classical Jaccard methods to the ontology alignment task. Automating such task will allow in fact to achieve a clearer conceptual description of the data at the cloud level, while improving the level of integration between datasets. <br/
ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution
Entity resolution (ER), an important and common data cleaning problem, is
about detecting data duplicate representations for the same external entities,
and merging them into single representations. Relatively recently, declarative
rules called matching dependencies (MDs) have been proposed for specifying
similarity conditions under which attribute values in database records are
merged. In this work we show the process and the benefits of integrating three
components of ER: (a) Classifiers for duplicate/non-duplicate record pairs
built using machine learning (ML) techniques, (b) MDs for supporting both the
blocking phase of ML and the merge itself; and (c) The use of the declarative
language LogiQL -an extended form of Datalog supported by the LogicBlox
platform- for data processing, and the specification and enforcement of MDs.Comment: To appear in Proc. SUM, 201
- …